FEAT: Add ArabiziConverter for Arabic transliteration#1906
Conversation
|
FYI @Raulster24 technically we already have a character-level converter! Check |
@romanlutz Makes sense, you are right. WordLevelConverter already covers this pattern (Leetspeak, Emoji), so a separate base class would duplicate it. I'll drop the CharacterSubstitutionConverter and keep the Arabic converters standalone as they are. Thanks for catching it. |
romanlutz
left a comment
There was a problem hiding this comment.
I reran the notebook to produce outputs. Looks great!
Description
Adds
ArabiziConverter, a deterministicPromptConverterthat transliterates Arabic script into Arabizi (Latin-script "chat Arabic"), where letters with no Latin equivalent are written with shape-resembling digits (HAH -> 7, AIN -> 3, QAF -> 8). It applies a per-character mapping with Gulf-leaning conventions; no language model is involved, so the same input always produces the same output. Short-vowel diacritics and the tatweel connector are dropped, and non-Arabic text (Latin, digits, punctuation) is left unchanged. The mapping is intentionally lossy, mirroring how Arabizi is actually written.The mapping follows the documented Arabic chat alphabet (Gulf-leaning where regional variants exist, e.g. QAF -> 8 with GHAIN -> gh to avoid the regional 8 collision). Feedback on specific letter choices is welcome.
Fourth in the set of atomic Arabic-script converters, following BidiConverter (#1832), TatweelConverter (#1869), and ArabicPresentationFormConverter (#1888). It can later migrate to a shared CharacterSubstitutionConverter base alongside UnicodeConfusableConverter.
cc @romanlutz
Tests and Documentation
tests/unit/prompt_converter/test_arabizi_converter.py: word transliteration, number-letters, multi-character mappings, dropped diacritics/tatweel, non-Arabic passthrough, mixed text, empty input, determinism, and unsupported-input-type rejection. All pass:uv run pytest tests/unit/prompt_converter/test_arabizi_converter.pypyrit/prompt_converter/__init__.py(import +__all__).doc/code/converters/1_text_to_text_converters.pyand regenerated the paired.ipynbplus the converter modality table in0_converters.ipynbvia JupyText.ruffandtyare clean; the converter-documentation conformance test passes.